Mean of a population: $\quad \mu = \frac{\sum_i^N x_i}{N}$
Variance of a population: $\quad \sigma^2 = \frac{\sum_i^N{(x_i - \mu)^2}}{N}$
Standard deviation of a population: $\quad \sigma = \sqrt{\frac{\sum_i^N{(x_i - \mu)^2}}{N}}$
Number in Population: $\quad N$
For gaussion distributions, the mean and standard deviation completely describe the distribution.
A frequent pitfall occurs when reporting a mean and standard deviation for a non-gaussian distribution. When a distribution is not gaussian, the standard deviation can still be calculated and reported, but the meaning is easily misinterpreted.
Consider the example from reference [1] in which certain 1976 medical articles had an average of 4.9 authors $\pm7.3$ (SD). As face value, this indicates that 95% of articles had $4.9\pm(1.96\times7.3)$ authors, or from $-9.4$ to $+19.2$; or that 25% of articles had zero or fewer authors.
In this situation, a better description would be provided using the mean and range, or quartile ranges.
The mean and standard deviation are used to describe the distribution of a population of measurements and to estimate the distribution a population based on a sample of measurements.
This difference means there are two slight changes in the calculation of the standard deviation for the second case. First instead of the mean being the exact population mean, $\mu$, the mean is an estimate of the larger population mean, $\bar{x}$, though it is calculated the same way. The second difference, which actually changes the result, is that the sum of the squared differences from the mean is divided by $n-1$ rather than $N$. This change produces a wider distribution, corresponding to the sample underestimating the full spread of values present in the entire population.
Estimate of the population mean: $\quad \bar{x} = \frac{\sum_i^n x_i}{n}$
Estimate of the population standard deviation from a sample: Typical or average difference between the data points and their mean. $$\quad SD = \sqrt{\frac{\sum_i^n{(x_i - \bar{x})^2}}{n-1}}$$
Number in Sample: $\quad n$
While there is only a slight difference in calculating these two standard deviations, there is a real difference in their meaning. Unfortunately it is rare to see a distinction made when reporting a standard deviation value.
The standard error provides a prediction of the confidence interval for which the true value should be; it does not describe the distribution of values (it also has nothing to do with standards or errors). Whenever used, it should be described what statistic it corresponds to, e.g., standard error of the mean (SEM).
Standard error of the mean: A measure of hov variable the mean will be, if you repeat the whole study many times. $$\quad SEM = SD_{\bar{x}} = \frac{SD}{\sqrt{n}} = \sqrt{\frac{\sum_i^n{(x_i - \bar{x})^2}}{n\, (n-1)}}$$
The SEM can be use to declare something like the following: "The mean of the sample was 73 mg/dL, with an SE of the mean of 73 mg/dL. This implies that the mean of the population from which the sample was randomly taken will fall, with 95% probability, in the interval of $73\pm(1.96*3) mg/dL, which is from 67.12 to 78.88 mg/dL." (from reference [1])
The above formula for the $SEM$ can be rearranged for more efficient computation as follows. $$ SEM^2 = \frac{\sum_i^n{(x_i-\bar{x})^2}}{n (n-1)}$$
$$ SEM^2 = \frac{\sum_i^n{(x_i^2 - 2x_i\bar{x} + (\bar{x})^2)}}{n (n-1)}$$$$ SEM^2 = \frac{1}{n-1} \left[\frac{\sum_i^n x_i^2}{n} - 2\,\bar{x}\,\frac{\sum_i^n x_i}{n} + (\bar{x})^2\,\frac{\sum_i^n 1}{n}\right]$$Note that $$\frac{\sum_i^n x_i}{n} = \bar{n}\,,$$ and $$\frac{\sum_i^n 1}{n} = 1\,.$$ Making these replacements, we are left with, $$ SEM^2 = \frac{1}{n-1} \left[\frac{\sum_i^n x_i^2}{n} - 2(\bar{x})^2 + (\bar{x})^2\right]\,,$$ or simply, $$ SEM^2 = \frac{1}{n-1} \left[\frac{\sum_i^n x_i^2}{n} - (\bar{x})^2\right]\,.$$
Written with angled braces to represent averages this is simply, $$ SEM^2 = \frac{\langle x^2 \rangle - \langle x \rangle^2}{n-1}\,.$$
This form allows for calculating the variance (part in square braces) in two passes of the numerical series, one summing $x^2$ (first term in square braces), and another summaing $x$ to get the mean.
The standard error of proportion is used when describing the proportion of group that have a certain classification. An example from reference [1] is when six of ten patients with zymurgy exhibit so-and-so. The natural interpretation is that we should expect to see so-and-so for 60% of patients with zymurgy.
The standard error of proportion provide an estimate for confidence intervals in this situation.
Standard error of proportion: $\quad SE_p = \sqrt{\frac{p (1-p)}{n}} $
Proportion estimated from sample: $\quad p$
In [7]:
def calc_sep(p, n):
return np.sqrt(p * (1-p)/n)
p = 0.6
In [16]:
n=10
sep = calc_sep(p, n)
ci95 = 1.96 * sep
interval = p - ci95, p + ci95
print('sep : {:0.3}\n95% ci for n={}: {:0.3} to {:0.3}'.format(sep, n, interval[0], interval[1]))
So when 6 of 10 patients have a certain classification, the 95% confidence interval for predicting the classification of the entire population is $60\% \pm (1.96\times 0.155)$ or from $29.6\%$ to $90.4\%$.
In [17]:
n=100
sep = calc_sep(p, n)
ci95 = 1.96 * sep
interval = p - ci95, p + ci95
print('sep : {:0.3}\n95% ci for n={}: {:0.3} to {:0.3}'.format(sep, n, interval[0], interval[1]))
Increasing the population size to 100 reduces the 95% confidence interval that the population exhibit so-and-so to be between $50.4\%$ to $69.6\%$.
In [18]:
n=1000
sep = calc_sep(p, n)
ci95 = 1.96 * sep
interval = p - ci95, p + ci95
print('sep : {:0.3}\n95% ci for n={}: {:0.3} to {:0.3}'.format(sep, n, interval[0], interval[1]))
Increasing the population size to 1000 reduces the 95% confidence interval that the population exhibit so-and-so to be between $57\%$ to $63\%$.
If one wishes to provide a description of the sample, then the standard deviations of the relevant parameters are of interest. For instance we would provide the mean age of the patients and standard deviation, the mean size of tumors and standard deviation, etc.
If, on the other hand, one wishes to have the precision of the sample value as it relates to that of the true value in the population, then it is the standard error that should be reported. For instance, when reporting the survival probability of a sample we should provide the standard error together with this estimated probability. However, because the confidence interval is more useful and readable than the standard error, it can be provided instead as it avoids having the readers do the math.
Uncertainty in the scattering intensity should be calculated from two sources, the photon counting statistics, and the standard error of the mean of the pixel values. "The photon counting (Poisson) statistics defines the absolute minimum possible uncertainty in any counting procedure. It does not consider other contributors to noise such as the variance between pixel sensitivities or electronic noise" [1]. Additionally, we can set a lower limit on the error in scattering intensity to never have a relative uncertainty estimate lower than 1%, as beamlines report it is challenging to be more accurate than this [2].
This leaves the choice as to what the grid spacing should be. Typically users opt for either uniform, or logaritmically spaced $Q$-grids, the later providing more points at lower $Q$ values [3].
So in SAS, the error is the standard error of the mean. Least-squares fits should be weighted by the variance, but typically the number of $Q$-bins, $N_{Q_\text{bin}}$, is not reported.
$$ y = m x + b
Often, the standard deviation of the $Q$-values are also reported. Similar to the procedure for the scattering intensity, this can be converted to a standard error of the mean by dividing by $\sqrt{N_{Q_\text{bin}}}$,
$$ \sigma_{Q} = \dfrac{1}{\sqrt{N_{Q_\text{bin}}}} \sqrt{\dfrac{\displaystyle\sum_{Q_j\in\left[Q_k,\,Q_{k+1}\right]}\!\!\!\!\!\!\!\!\left(Q_j-\bar{Q}\right)^2}{N_{Q_\text{bin}}-1}} \,, $$where
$$ \bar{Q} = \langle Q_j\in\left[Q_k,\,Q_{k+1}\right]\rangle\,. $$I have never seen the $Q$ standard deviation or standard error of the mean factored into the Guinier fitting, or even plotted for that matter. What difference would this make? Should it be included?
In [ ]: